Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")This project analyzed flight delay data to identify patterns in airport performance and delay causes. SFO had the highest overall delay rate, and the best months to fly with fewer delays were September through November. We also calculated a new metric for total weather-related delays to better understand their impact.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
We replaced missing values with consistent NaN values using df.replace(). Below is one example row showing the airport_name missing value represented as NaN.
# Include and execute your code here
missing_values = ["", " ", "NA", "NULL", "null", ".", -999]
df.replace(missing_values, np.nan, inplace=True)
print(df.iloc[2])airport_code IAD
airport_name NaN
month January
year 2005.0
num_of_flights_total 12381
num_of_delays_carrier 414
num_of_delays_late_aircraft 1058.0
num_of_delays_nas 895
num_of_delays_security 4
num_of_delays_weather 61
num_of_delays_total 2430
minutes_delayed_carrier NaN
minutes_delayed_late_aircraft 70919
minutes_delayed_nas 35660.0
minutes_delayed_security 208
minutes_delayed_weather 4497
minutes_delayed_total 134881
Name: 2, dtype: object
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
ORD has the worst delays. I used delay_proportion and avg_delay_hours to compare airports. ORD had the highest avg_delay_hours (7115.67) and second highest delay_proportion (0.23).
# Include and execute your code here
df_clean = df.dropna(subset=[
"num_of_flights_total",
"num_of_delays_total",
"minutes_delayed_total"
])
# Convert delay columns to numeric just in case any were stored as strings
df_clean["num_of_flights_total"] = pd.to_numeric(df_clean["num_of_flights_total"])
df_clean["num_of_delays_total"] = pd.to_numeric(df_clean["num_of_delays_total"])
df_clean["minutes_delayed_total"] = pd.to_numeric(df_clean["minutes_delayed_total"])
# Group by airport and calculate summary metrics
summary = df_clean.groupby("airport_code").agg(
total_flights=("num_of_flights_total", "sum"),
total_delays=("num_of_delays_total", "sum"),
avg_delay_minutes=("minutes_delayed_total", "mean")
).reset_index()
# Calculate proportions and convert minutes to hours
summary["delay_proportion"] = summary["total_delays"] / summary["total_flights"]
summary["avg_delay_hours"] = summary["avg_delay_minutes"] / 60
# Display summary table with selected columns
summary[["airport_code", "total_flights", "total_delays", "delay_proportion", "avg_delay_hours"]].round(2)| airport_code | total_flights | total_delays | delay_proportion | avg_delay_hours | |
|---|---|---|---|---|---|
| 0 | ATL | 4430047 | 902443 | 0.20 | 6816.15 |
| 1 | DEN | 2513974 | 468519 | 0.19 | 3178.46 |
| 2 | IAD | 851571 | 168467 | 0.20 | 1298.42 |
| 3 | ORD | 3597588 | 830825 | 0.23 | 7115.67 |
| 4 | SAN | 917862 | 175132 | 0.19 | 1044.98 |
| 5 | SFO | 1630945 | 425604 | 0.26 | 3352.33 |
| 6 | SLC | 1403384 | 205160 | 0.15 | 1278.20 |
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I used delay_proportion to show that the best months to avoid delays are September, October, and November.
# Include and execute your code here
# Filter out rows where 'month' is missing
df_month = df[df['month'].notna()]
# Group by month
monthly_stats = df_month.groupby('month').agg({
'num_of_delays_total': 'sum',
'num_of_flights_total': 'sum'
}).reset_index()
# Calculate delay proportion
monthly_stats['delay_proportion'] = monthly_stats['num_of_delays_total'] / monthly_stats['num_of_flights_total']
# Order months (not alphabetically)
month_order = [
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
]
monthly_stats = monthly_stats[monthly_stats['month'].isin(month_order)]
monthly_stats['month'] = pd.Categorical(monthly_stats['month'], categories=month_order, ordered=True)
monthly_stats = monthly_stats.sort_values('month')
monthly_stats['month'] = monthly_stats['month'].astype(str) # convert to string to avoid casting issues
# Line chart
ggplot(monthly_stats, aes(x='month', y='delay_proportion')) + \
geom_line(size=1.5) + \
geom_point(size=3) + \
ggtitle('Proportion of Flight Delays by Month') + \
xlab('Month') + \
ylab('Proportion of Delayed Flights') + \
theme(axis_text_x=element_text(angle=45, hjust=1))According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
January had the most delays due to weather. ORD having the highest total delays due to weather.
# Include and execute your code here
late_aircraft_mean = df['num_of_delays_late_aircraft'].mean()
df['num_of_delays_late_aircraft'].fillna(late_aircraft_mean, inplace=True)# Include and execute your code here
# a
weather_severe = df['num_of_delays_weather']
# b
weather_late = df['num_of_delays_late_aircraft'] * 0.30
# c
weather_months = ['April', 'May', 'June', 'July', 'August']
weather_nas = np.where(
df['month'].isin(weather_months),
df['num_of_delays_nas'] * 0.40,
df['num_of_delays_nas'] * 0.65
)# Include and execute your code here
df['weather_total'] = weather_severe + weather_late + weather_nas
# First five rows
df[['airport_code', 'month', 'num_of_delays_weather', 'num_of_delays_late_aircraft', 'num_of_delays_nas', 'weather_total']].head()| airport_code | month | num_of_delays_weather | num_of_delays_late_aircraft | num_of_delays_nas | weather_total | |
|---|---|---|---|---|---|---|
| 0 | ATL | January | 448 | 1109.104072 | 4598 | 3769.431222 |
| 1 | DEN | January | 233 | 928.000000 | 935 | 1119.150000 |
| 2 | IAD | January | 61 | 1058.000000 | 895 | 960.150000 |
| 3 | ORD | January | 306 | 2255.000000 | 5415 | 4502.250000 |
| 4 | SAN | January | 56 | 680.000000 | 638 | 674.700000 |
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
SFO had the highest proportion of weather-delayed flights, followed by ORD.
# Include and execute your code here
# Calculate total weather delay proportion for each airport
df_weather = df.groupby('airport_code').agg({
'weather_total': 'sum',
'num_of_flights_total': 'sum'
}).reset_index()
df_weather['weather_delay_proportion'] = df_weather['weather_total'] / df_weather['num_of_flights_total']
# Create a barplot
ggplot(df_weather, aes(x='airport_code', y='weather_delay_proportion')) + \
geom_bar(stat='identity') + \
labs(
title='Proportion of Flights Delayed by Weather per Airport',
x='Airport',
y='Weather Delay Proportion'
)Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
type your results and analysis here
# Include and execute your code here